[Day29] 爬蟲實戰演練 - iThome文章標題 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2022 iThome 鐵人賽

DAY 29

AI & Data

30天帶你從零基礎到Python爬蟲系列第 29 篇

[Day29] 爬蟲實戰演練 - iThome文章標題

14th鐵人賽

霓霓

2022-09-29 00:19:00

3403 瀏覽

分享至

今天也是要去爬目標網站的網址以及相關資訊，不過我最喜歡今天的這篇，因為我要爬自己的文章，爬有關自己的資料特別興奮ヽ(✿ﾟ▽ﾟ)ノ，今天示範的網站是我去年的鐵人賽。

今天我想要列印出我的文章名稱、有多少觀看數、多少個like以及留言，最後還要上傳時間和網址。

確認真正會用到的網址

最一開始要先找出它真正的網址，我把畫面轉到第二頁時，上方的網址有變動，所以就不需要另外找隱藏的網址～

開始程式編寫

接下來的步驟大家應該都很熟悉了，import兩個函式庫（requests & BeautifulSoup），用get()取得網頁網址，再用bs4取得HTML程式碼。

import requests
from bs4 import BeautifulSoup

url = "https://ithelp.ithome.com.tw/users/20140998/ironman/4362?page=1"
response = requests.get(url)
html = BeautifulSoup(response.text, "html.parser")
print(html)

不過這次跟平常不一樣，如果列印出html會發現不是正常網頁的樣子，而是顯示403 Forbidden，其實就是Headers太短的問題，requests函式庫預設就會幫忙加一個短的Headers（Mozilla/5.0），有些網站查得比較寬鬆所以用短的Headers就可以通過，但看來iThome是個嚴格的網站呢

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

不過遇到這種情況也不用擔心，只要把完整的Headers加上去就好了！至於要從哪裡看呢？點開F12，上面選擇Network，隨便點一個資料，選擇Headers再滑到最下面找到user-agent，複製後面那一長串就好啦～

import requests
from bs4 import BeautifulSoup

url = "https://ithelp.ithome.com.tw/users/20140998/ironman/4362?page=1"
response = requests.get(url, headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"})
html = BeautifulSoup(response.text, "html.parser")
print(html)

再來後面的流程就跟平常差不多了，找到整個文章的區塊，利用迴圈跑一遍每筆資料並抓取需要的資訊。比較需要注意的可能是like、留言和觀看數的地方它們的class名稱一樣，所以需要先取得裡面所有的資料，最後列印時再用查詢的方式（[索引值]）來顯示。

for art in article:
    title = art.find("a", {"class":"qa-list__title-link"})  # 標題名稱
    views = art.find_all("span", {"class":"qa-condition__count"})  # like, 留言, 觀看數
    date = art.find("a", {"class":"qa-list__info-time"})  # 上傳日期
    print("標題", title.text.strip())
    print("觀看數:", views[2].text)
    print("like:", views[0].text)
    print("留言:", views[1].text)  
    print("上傳日期", date.text)
    print("網址", title["href"].strip())
    print("-" * 30)  # 分隔符號

完整程式碼

import requests
from bs4 import BeautifulSoup

url = "https://ithelp.ithome.com.tw/users/20140998/ironman/4362?page=1"
response = requests.get(url, headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"})
html = BeautifulSoup(response.text, "html.parser")
article = html.find_all("div", {"class":"qa-list"})

for art in article:
    title = art.find("a", {"class":"qa-list__title-link"})  # 標題名稱
    views = art.find_all("span", {"class":"qa-condition__count"})  # like, 留言, 觀看數
    date = art.find("a", {"class":"qa-list__info-time"})  # 上傳日期
    print("標題", title.text.strip())
    print("觀看數:", views[2].text)
    print("like:", views[0].text)
    print("留言:", views[1].text)  
    print("上傳日期", date.text)
    print("網址", title["href"].strip())
    print("-" * 30)  # 分隔符號